## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Many of wines are in quality 5 and 6,1457 and 2198 respectively. There are only 5 wines that are quality 9.
These three plots are about the acid in wine and the shape of them are similar.There is one peak in the middle in each graph. This seems to be natural for me be to make wine since too much acid make the taste too sour and too little may lose the taste of wine.
The histogram above shows the amount of sugar. I can observe one peak in the histogram. Residual sugar is continuous value so I should transform the long tail data.
Transforming the data, it appears to be bimodal distribution. I infer that there exists two types of wine:“sweet” and “not sweet” wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The amount of salt in wine is almost the same in every wine. 1st Qu. is 0.036 \(g/dm^3\) and 3rd Qu. is 0.05 \(g/dm^3\). The standard deviation of it is 0.021848.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## [1] 17.00714
Most wines of free-sulfur-dioxide are between 23.00 to 46.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
## [1] 42.49806
75% of tatal-sulfur-dioxide is below 167.0 The distribution is varied campared with that of free-sulfur-dioxide. I think this is because of the characteristic of free-sulfur-dioxide. If there exists much sulfur-dioxide, it will bind with sugars or other chemicals. Therefore, the distribution of free-sulfur-dioxide is narrow compared with total-sulfur-dioxide.
This distribution is understandable since the density of water is 1 and main componets of wine is also water. Therefore the density is around 1. Since the density of alcohol is below 1, most of the density is below 1,I think.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
This distribution is interesting to me. The distribution is wide and this may be an important factor of wine’s taste(That leads to the quality of wine.). I’ll take a closer look at this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates is a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant.So I have to check the relationship between this and free-sulfur-dioxide or total sulfur-dioxide.
From now on, I define three new variables that are necessary to look into the datasets.Those are “class”,“bound.sulfur.dioxide” and“ratio”. I’ll describe those variables in the Univariate Analysis section.
## 'data.frame': 4898 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ class : Factor w/ 3 levels "bad","normal",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ bound.sulfur.dioxide: num 125 118 67 139 139 67 106 125 118 101 ...
## $ ratio : num 0.265 0.106 0.309 0.253 0.253 ...
This dataset contains 4898 of observations and 11 variables and one output(quality). Variables are X,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulpahates,alcohol and quality. All of the data in this data set is number.Most of the wines are quality 5 or 6 and high quality(9) or low quality(3) of wines are rare.
The main feature of interest is quality of wine. I’m interested to know what factors are important to be evaluated high quality wine.
Alcohol,total-sulfur-dioxide,free-sulfur-dioxide and volatile acidity will affect the quality of wine. I think alcohol contribute most to the quality of wine after researching infromation on wine.
Since the number of quality 9 of wine and quality 3 of wine are small,it would be better to classify into group. I classified them as below.
I created “bound sulfur dioxide”which is defined as below. \[ Bound\ SO_2 = Total\ SO_2 -Free\ SO_2 \] I also created the ratio between free-sulfur-dioxide and total-dulfur-dioxide.
There are no missing data in this dataset. I found many outliers but those data aren’t erroneous ones(I think),therefore I didn’t adjust or change the form of data.
Which chemical properties influence the quality of white wines? To begin with,I assume that good wine is the wine that has quality 8 or 9 and bad wine is the wine that has the quality of 3,4,5 as mentioned above.
I tried all polts and the graph below is the most significant. The median of alcohol concentration in “good” wine is conspicuously high.It may say that good wines have high alcohol concentration in common,but this isn’t apparently enough. Our sense of taste is sentitive. high alcohol is necessary but not must.So what’s other chemical properties that influence the quality of wine?
The quantile of alcohol in good wine
## 0% 25% 50% 75% 100%
## 8.5 11.0 12.0 12.6 14.0
The quantile means that top 75% of good wines have more than 11.0% of alcohol concentration. So I handle only alcohol concentration that are equal or more than 11.0%
This graph shows that “bad” wines tend to have low amount of free sulfur dioxide.This was surprising for me since sulfur dioxide is added to prevent oxidation, so large amount of sulfur dioxide is bad for our health and small amount is good for us. However, judging from this graph, I guess bad wine is vulnerable to oxygen and that may affect the taste of wine.
To the contrary,the amount of bound sulfur dioxide in “good” wine is lower compared with “bad” wine. I’ll expaine about bound sulfur dioxide.Bound sulfur dioxide is,as the name suggests, the sulfur dioxide that combine other chemicals such as sugar or acids. Bound sulfur dioxide have no(or weaker) anti oxidation so high amount of bound dioxide is can be no meaning or can affect the taste of wine. High ratio of free/total sulfur dioxide is meaningful and I infer that good wine will have high ratio.
This graph means that “good” wine is likely to have high ratio of free/total sulfur dioxide.This is also one of the factor to be “good” wine.
I’m curious whether high amount of free sulfur dioxide affects the pH,since free sulfur dioxide is in the state below. \[H_2O+ SO_{2} \Leftrightarrow SO_{3}^-+H^+ \] This will affect the pH.So I plotted the graph.
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and pH
## t = 2.5176, df = 1717, p-value = 0.01191
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01340625 0.10761664
## sample estimates:
## cor
## 0.06064651
From this graph, I can’t conclude that free sulfur dioxide affects the pH.To testify,I tried to caluculate the correlation. The result is -0.06.This means there is no relation between two variants.(The order of free-sulfur-dioxide is mg/dm^3. On the other hand, ther order of fixed acidity that is acid is g/cm^3. Therefore, other acids have strong effect on the pH.)
Is there any other factors? I plotted the below.(I intentionally omit the condition of alcohol>=11.0 in the below graphs and analysis.)
“Good” wine has lower density than other wines. I think this is related with the high alcohol concentration in “good” wine,since alcohol’s density is lower than 1.That’ll affect the density.
Clearly there is a relationship between density and alcohol.The higher alcohol concentration leads to lower density.
##
## Pearson's product-moment correlation
##
## data: wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
The correlation also shows that there’s strong relationship.
In this part, I observed the relationship between class and alcohol.The “good” wines have high concentration of alcohol compared with others.So this is one of the factors to be good wine.However, not all of the wines that have high concentration of alcohol are good.Therefore, I also checked the other factors in the condition of high alcohol. The results are that good wine also tend to have high free-sulfur-dioxide and high ratio of free/total sulfur-dioxide.
I’ve also researched about the ralationship between alcohol concentration and the density.Since good wines have high alcohol, I thought the density of high alcohol wines is low and the result was true. Alcohol and density are highly related with each other.
The relationshp between class and alcohol is the strongest of all.High alcohol concentration is the most important factor of all.The analysis above are done in the following condition: alcohol concentration is more than 11.0 %
In the analysis above, I could differentiate between “good” and “bad” wines. However,judging from the graphs above I couldn’t diffrentiate between “good” and “normal” wines. So in this chapter, I try to research what is the main difference between “good” and “normal” wines. The analysis above were done by the condition that the alcohol concentration is more or equalls to 11.0 % and there seems to me that there is no difference between “good” and “normal” wines.
I could see the small difference in pH,ratio(defined above),sulphates and residual.sugar.So I’ll plot these factors into graphs with facet grid.
The first graph shows a little diffrence in ratio. good wine(quality 8 or 9) has slightly high ratio of free-sullfur-dioxide. From these plots, I couldn’t find meaningful difference between “normal” and “good” wines.Therefore, I can conclude that the difference between “normal” and “good” wines are only alcohol concentration.
In the previous analysis section, I could show the difference between “bad” and “good” wines.However, I failed to tell apart the diffrence between “normal” and “good” wines in the conditon of high alcohol concentration. Therefore, in this section I tried to show the diffrence between “normal” and “good” wines without using alcohol concentration. Though plotting the all values that seems meaningful, I couldn’t see any difference between “normal” and “good” wines.This means the difference exists only alcohol.
In the multivariate Plots section, I showed there is no difference between “normal” and “good” wines if I don’t use the alcohol condition.
I tried to make decision tree model.One is without cross validation,another is using cross validationl. I also tried to make linear model of this dataset.
# set the random seed
set.seed(1)
n<-nrow(wine)
# shuffling the data set
shuffled<-wine[sample(n),]
# devide into train and test set.(60:40)
train<-shuffled[1:round(0.6*n),]
test<-shuffled[(round(0.6*n)+1):n,]
# make the decision tree model
tree<-rpart(quality ~.,train,method="class")
# make prediction
pred<-predict(tree,test,type="class")
#Evaluate the model
conf<-table(test$quality,pred)
sum(diag(conf))/sum(conf)
## [1] 0.5880551
# Decision Tree
fancyRpartPlot(tree)
The decision tree model without using cross validation, The result is 58.8%
# Initialize the accs vector
accs<-rep(0,6)
for (i in 1:6) {
# These indices indicate the interval of the test set
indices <- (((i-1) * round((1/6)*nrow(shuffled))) + 1):((i*round((1/6) * nrow(shuffled))))
# Exclude them from the train set
train_cross <- shuffled[-indices,]
# Include them in the test set
test_cross <- shuffled[indices,]
# A model is learned using each training set
tree_cross <- rpart(quality ~ ., train_cross, method = "class")
# Make a prediction on the test set using tree
pred_cross<-predict(tree_cross,test_cross,type="class")
# Assign the confusion matrix to conf
conf_cross<-table(test_cross$quality,pred_cross)
# Assign the accuracy of this model to the ith index in accs
accs[i]<-sum(diag(conf_cross))/sum(conf_cross)
}
# Print out the mean of accs
mean(accs)
## [1] 0.595384
#Decision Tree
fancyRpartPlot(tree_cross)
The decision tree model with cross validation, the result is 59.5% The result improves when applying cross validation(and reliable).
lm<-lm(quality~.,wine)
summary(lm)
##
## Call:
## lm(formula = quality ~ ., data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83559 -0.44345 -0.00508 0.39949 1.67588
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.442e+01 1.476e+01 5.721 1.12e-08 ***
## X -4.955e-05 6.648e-06 -7.453 1.07e-13 ***
## fixed.acidity 3.005e-02 1.662e-02 1.808 0.0707 .
## volatile.acidity -1.210e+00 9.066e-02 -13.342 < 2e-16 ***
## citric.acid -1.174e-01 7.529e-02 -1.559 0.1191
## residual.sugar 4.792e-02 5.932e-03 8.077 8.28e-16 ***
## chlorides -4.512e-01 4.276e-01 -1.055 0.2913
## free.sulfur.dioxide -1.216e-02 1.985e-03 -6.126 9.74e-10 ***
## total.sulfur.dioxide 2.604e-03 5.475e-04 4.756 2.03e-06 ***
## density -8.449e+01 1.496e+01 -5.647 1.72e-08 ***
## pH 3.550e-01 8.394e-02 4.229 2.39e-05 ***
## sulphates 5.523e-01 7.874e-02 7.014 2.63e-12 ***
## alcohol 1.796e-01 1.900e-02 9.455 < 2e-16 ***
## classnormal 1.655e+00 4.636e-02 35.708 < 2e-16 ***
## classgood 3.445e+00 6.471e-02 53.234 < 2e-16 ***
## bound.sulfur.dioxide NA NA NA NA
## ratio 2.209e+00 2.845e-01 7.765 9.90e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5869 on 4882 degrees of freedom
## Multiple R-squared: 0.5622, Adjusted R-squared: 0.5608
## F-statistic: 417.9 on 15 and 4882 DF, p-value: < 2.2e-16
The result of linear model is 56.1% Of all of these models Decision tree with cross validation is the highest performance.
```
This graph clearly shows the diffrence of alcohol concentration distribution.“Good” wines
tend to have higher alcohol concentration than “normal” and “bad” wines.
This graph explains the difference between “good” and “bad” wines in high alcohol concentration.“Bad” wines are likely to hold small amount of Free Sulfur-dioxide.That means that wines are vulnerable to the oxygen.
This graph shows that high quality wines(8,9) tend to have slightly higher ratio of free sulfur dioxide than other wines(3~7).pH distribution doesn’t change so much by the quality of wine.
The White wine dataset contains almost 4900 observation and has 11 variables and one output(quality).First, I tried to understand each of the variable. I plotted the distribution on the way.Then I found that the significant diffrence between good and bad wines is in the alcohol.So I confined the data in high alcohol to ascertain the other significant factors. I found that good wines have high amount of free sulfur dioxide and small amount of total amount of sulfur dioxide. Bad wines, on the other hand, have small amount of free sulfur dioxide and have high amount of total sulfur dioxide.I could differentiate bad and good wines.However, I had hard time to tell apart normal and good wines. There is clear difference in alcohol concentration, but apart from that, I couldn’t find any difference between normal and good wines.Wine contains many ingredients and all of them are quite important when we taste.Our sense of taste are quite sensitive to them so we may feel the tiny difference.That may be the reason why there’s little difference between normal and good wines.
If the data contains year of made,production area,number of people who tasted and price, it would be more interesting to ascertain the quality of wines.(However,since these factors contain privacy, those data won’t be revealed.)